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DISCO ANALYSIS: A NONPARAMETRIC EXTENSION OF 
ANALYSIS OF VARIANCE 

By Maria L. Rizzo and Gabor J. Szekely 1 

Bowling Green State University 

In classical analysis of variance, dispersion is measured by con- 
sidering squared distances of sample elements from the sample mean. 
We consider a measure of dispersion for univariate or multivariate re- 
sponse based on all pairwise distances between-sample elements, and 
derive an analogous distance components (DISCO) decomposition for 
powers of distance in (0, 2]. The ANOVA F statistic is obtained when 
the index (exponent) is 2. For each index in (0,2), this decomposi- 
tion determines a nonparametric test for the multi-sample hypothesis 
of equal distributions that is statistically consistent against general 
alternatives. 

1. Introduction. In classical analysis of variance (ANOVA) and multi- 
variate analysis of variance (MANOVA), the ET-sample hypothesis for equal 
means is 

(1.1) H :fi 1 = --- = fi K 

vs Hi : fj,j |t/fc, for some j ^ k, where Hi,... are the means or mean 
vectors of the K sampled populations. Inference requires that random error 
is normally distributed with mean zero and constant variance (see, e.g., 
Cochran and Cox (1957), Scheffe (1953), Searle, Casella and McCulloch 
(1992), Hand and Taylor (1987), or Mardia, Kent and Bibby (1979)). 

Analysis of variance partitions the total variance of the observed response 
variable into SST (sum of squared error due to treatments) and SSE (sum 
of within-sample squared error). When the usual assumptions of normality 
and common error variance hold, under the null hypothesis distributions are 
identical, and under the alternative hypothesis distributions differ only in lo- 
cation (are identical after translation). If distributions differ in location only, 
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for a univariate response, methods based on ranks such as the nonparametric 
Kruskal-Wallis test or Mood's median test can be applied to test the hy- 
pothesis of equal population medians (see, e.g., Hollander and Wolfe (1999, 
Chapter 6)). 

In case the assumptions of normality or common variance do not 
hold, one could apply F statistics via a permutation test procedure 
(Efron and Tibshirani (1993, Chapter 15), Davison and Hinkley (1997, Chap- 
ter 4)). However, in practice, distributions with equal means may differ in 
other characteristics, while F statistics test the hypothesis (1.1) of equal 
means. 

We extend AN OVA and MANOVA to testing the more general hypothesis 
(1.2) with the help of a decomposition for other exponents than squared 
distance. 

For K independent random samples from distributions with cumulative 
distribution function (c.d.f.) F\,. . . ,Fk respectively, the if-sample hypoth- 
esis for equal distributions is 

(1.2) H :F 1 = -- = F K 

versus the composite alternative Fj ^ F^ for some 1 < j < k < K. Here each 
of the K random variables are assumed to take values in W for some integer 
p > 1, and the distributions Fj are unspecified. 

We propose a new method, called distance components (DISCO), of mea- 
suring the total dispersion of the samples, which admits a partition of the 
total dispersion into components analogous to the variance components in 
ANOVA. The resulting distance components determine a test for the more 
general hypothesis (1.2) of equal distributions. We introduce a measure of 
dispersion based on Euclidean distances between all pairs of sample ele- 
ments, for any power a of distances such that a £ (0, 2], hereafter called the 
index. The usual ANOVA decomposition of the total squared error is ob- 
tained as the special case a = 2. For all other values of the index < a < 2, 
we obtain a decomposition such that the corresponding "F" statistic deter- 
mines a test of the general hypothesis (1.2) that is statistically consistent 
against general alternatives. 

Akritas and Arnold (1994) proposed a general model for structured data 
where the distribution of the response variable is modeled in terms of distri- 
butions. A hypothesis of no treatment effect or no interaction effect is that 
the corresponding distribution term in the model is identically zero. For an 
overview, see Brunner and Puri (2001) and the references therein. 

Other distance based approaches to testing (1.1) or (1.2) have been pro- 
posed in recent literature by Gower and Krzanowski (1999) and Anderson 
(2001), with applications in ecology, economics, and genetics (McArdle and 
Anderson (2001); Excoffier, Smouse and Quattro (1992); Zapala and Schork 
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(2006)). These methods differ from our proposed approach in that they em- 
ploy the squared distance (and thus test a different hypothesis), a different 
way of decomposing the distances, or a dissimilarity measure other than 
powers of Euclidean distances. 

Our main results, the statistics for measuring distances between samples 
and the method of partitioning the total dispersion, are introduced in Section 
2. Properties of these statistics and the proposed DISCO test for the general 
hypothesis (1.2) are presented in Section 3, and DISCO decomposition for 
multi-factor models follows in Section 4. Implementation, examples, and 
empirical results are covered in Sections 5 and 6. 



2. Distance components. 



2.1. DISCO statistics. Define the empirical distance between distribu- 
tions as follows. For two samples A = {a\, . . . , a ni } and B = {b±, . . . , b n2 }, 
the d Q -distance between A and B is defined as 

d a (A,B) = J^L. [ 2 g a (A,B) — g a (A, A) — g a (B, £>)], 
m + n 2 

where 

(2-1) 9a(A,B) = VV|k-6 m || Q 

n\n 2 f-f ^— ' 

i=l m=l 

is a version of the Gini mean distance statistic and || • || denotes the Euclidean 
norm. The constant ni ," 2 is half the harmonic mean of the sample sizes. 

ni+ri2 ^ 

In the special case a = 2, the ^-distance for a univariate response vari- 
able measures variance, and there is an interesting relation between the 
^-distances and the ANOVA sum of squares for treatments. The details are 
explored below. 



Proposition 1 . Let A = {ai a ni } and B = {hi, . . . , b n2 } with means 
a and b respectively. Then 

d 2 {A, B) = 2SST = 2[m(a - c) 2 + n 2 (b - c) 2 ], 

where c = (ma + n 2 b) / (n± + n 2 ) . 



The proof of Proposition 1 is given in the Appendix. 

In the following, Ai,..., Ak are p-dimensional samples with sizes n\, . . . , 
uk respectively, and N = n\ + ■ ■ ■ + uk- 

The i^-sample d Q -distance statistic that takes the role of ANOVA sum 
of squares for treatments is the weighted sum of dispersion statistics. For 
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the balanced design with common sample size n, define the between-sample 
dispersion as 

(2.2) S a = S a (A 1 ,A 2 ,...,A K ) = ± da(Aj,A k ). 

l<j<k<K 

For unbalanced designs with sample sizes n\,n 2 ,.. ■ ,nx, for each pair of 
samples the factor 1/K = n/N in (2.2) is replaced by hjf./N, where hjk is 
the arithmetic mean of rij and n^. Thus, for the general case the between- 
sample dispersion is 



S a = S a (A 1 ,A 2 ,...,A K ) = £ f^^W^fc) 

S { ( 29a ( A J ' Ak ^ - 9a (4/ ' A i) - 3* ( Ak ' Ak ) ) } • 



l<j<k<K 

Note that if K = 2, p = 1, and a = 2, we have S 2 = d 2 (A 1 ,A 2 )/2 = SST. 

It follows from Theorem 1 in the following section that for all < a < 
2 the statistic S a determines a statistically consistent test for equality of 
distributions. 

First let us explore the relation between S 2 and SST. A well-known U- 
statistic is the sample variance S 2 . If x\, . . . , x n is a sample, then 



(2.4) ^ y ^(^-^) 2 =5 2 = -l T ^(x i -s 



i<m<n i=l 



This example is given by Serfling (1980), page 173. Notice that if A\, . . . , Ak 
have common sample size n, then (A.l) and (2.4) can be applied to compute 



K ^ " J ' (2) ^ 2 



S 2 

^<3<k<K \2J l<j<k<K 

K 

= Y2 n (o-j — a..) 2 = SST. 

3=1 

In the case of arbitrary sample sizes, the same relation holds: S 2 = SST. This 
identity is obtained as a corollary from the decomposition of total dispersion 
into the between and within components, which follows in Section 2.2. 

2.2. DISCO decomposition. Define the total dispersion of the observed 
response by 

N 

(2.5) T a = T a (A 1 ,...,A K ) = -g a (A,A), 
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where A = Ylj=i A? ^ s the P°°l e d sample and g a is given by (2.1). Similarly, 
define the within-sample dispersion statistic 



A 



(2.6) W a = W a (A 1 ,...,A K )=J2 -fgaiA^Aj). 



3=1 



Then if < a < 2, we have the decomposition T a = S a + W a , where both 
S a and W a are nonnegative. Moreover, for < a < 2, S a = if and only if 
A± = ■•■ = Ak- For the proof, we need the following definition and theorem. 

Suppose that X and X' are independent and identically distributed (i.i.d.), 
and Y and Y' are i.i.d., independent of X. If a is a constant such that 
-Ell^H < oo and .E||y|| a < oo, define the £ a -distance (energy distance) be- 
tween the distributions of X and Y as 



Theorem 1. Suppose that X and X' E W are i.i.d. with distribution F, 
Y and Y' E MP are i.i.d. with distribution G, and Y is independent of X. 
If0<a<2isa constant such that E\\X\\ a < oo and < oo, then the 

following statements hold: 

(i) e a (x,Y)>o. 

(ii) If0<a<2, then £. a (X,Y) = if and only if X = Y . 
(hi) If a = 2, then <? a (X, Y)=Q if and only if E[X] = E[Y] . 

Proof. The proof for multivariate samples is given in Szekely and Rizzo 
(2005b). Here we present a more elementary proof for the univariate case. 

First consider the case < a < 2. Using the fact that \X — Y\ a is a non- 
negative random variable, and making the substitution u = t l l a , we have 



E a (X, Y) = 2E\\X -Y\\ a - E\\X - X'\\ a - E\\Y - Y 



l\\a 



E\X-Y 



a 







Similarly, 
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Thus, 



(2.7) 2a f \u\ a ~ l {F{u)-G{u)fdu 

= 2a f |u| a_1 [F(u)(l-G(iO) + (l-F(u))G(u) 



- F(u)(l - F(u)) - G(u)(l - G(u))j du 

(2.8) = 2E\X - Y\ a - E\X - X'\ a - E\Y - Y'\ a = E a (X,Y). 

The integral (2.7) converges to a non-negative constant if \a — 1| < 1. Hence, 
(2.8) is non-negative and finite for all < a < 2. A necessary and sufficient 

condition that (2.8) equals zero is that F = G a.e., and X = Y . This proves 
(i) and (ii) for the case < a < 2. 
Finally, for the case a = 2, we have 

E 2 (X,Y) = 2E\X - Y\ 2 - E\X - X'\ 2 - E\Y - Y'\ 2 

= 2{E[X] — E[Y]) 2 > 0, 

with equality if and only if E[X) = E[Y). □ 

A consequence of Theorem 1 is that the empirical distance between sam- 
ples is always non-negative: 

Corollary 1. For all p- dimensional samples A±, . . . ,Ak, K > 2, and 
< a < 2, the following statements hold: 

(i) S a (A 1 ,...,A K )>0. 

(ii) If < a < 2, then S a (Ai, . . . , Ak) = if and only if A\ = ■■■ = Ak- 

(iii) S2{A\, . . . , Ak) = if and only if A%,. . . , Ak have equal means. 

PROOF. Let Aj = {ai, . . . , a nj } and Ak = {&i, ■ . . , b nk }. Define i.i.d. ran- 
dom variables X and X' uniformly distributed on Aj, and define i.i.d. ran- 
dom variables Y and Y' uniformly distributed on A^. Then E\\X — Y\\ a = 
g a {A v A k ), E\\X-X'\\ a = g a (A j ,A j ), E\\Y - Y'\\ a = g a (A k ,A k ), and 

J^E a (X,Y) = d a (A J ,A k ). 
n\ + ri2 

Hence, for all < a < 2, Theorem l(i) implies that S a (Aj,A k ) > 0. If < 

a < 2, then by Theorem 1 (ii) equality to zero holds if and only if X = Y (if 
and only if Aj = A k ). This proves (i) and (ii) for the case K = 2, and the 
result for K > 2 follows by induction. Statement (iii) follows from Theorem 
l(iii). □ 

Our next theorem is the DISCO decomposition of total dispersion into 
between-sample and within-sample components. 
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Theorem 2. For all integers K > 2, the total dispersion T a (2.5) of K 
samples can be decomposed as 

r a (Ai, ...,A K ) = S a (A u Ak) + W a (A u A K ), 

where S a >0 and W a > are the between- sample and within- sample mea- 
sures of dispersion given by (2.3) and (2.6), respectively. 

Proof. Let Gjk = njnkg a (Aj,A k ), and 

9jk — 9a{Aj,Ak). First consider 
the balanced design, with common sample size n. In this case (rijUk) / '(nj + 
n k) = n/2 and S a can be computed by (2.2), so that 

N 1 v - n 

T a - S a = —g(A, A) - — 2 i 2 9jk ~ 9jj ~ 9kk) 

j<k 




= 277 ^ n2 *' J ' = 2^ n2gji = 5 S = 

The proof for the general case is similar; the details are given in the Appendix. 

□ 

Corollary 2. If p = 1, then for all integers K > 2 the between- sample 
dispersion S 2 for K samples is equal to SST, and the a = 2 decomposition 
of total dispersion T 2 = S 2 + W 2 is exactly the AN OVA decomposition of the 
total squared error: SS (total) = SST + SSE . 

Proof. Applying (2.4) to the a = 2 Gini statistics shows that, for sam- 
ples Ai,...,A K , 

g 2 (A j ,A j ) = 2a], 

where a 2 = nj 1 Yli=i( a ij ~ ^-i) 2 ' J = 1 , . . - , -fiT - The within-sample sum of 
squares is Ylf=i n j&j- Similarly, the total sum of squares is Na 2 = ^g 2 (A, A). 

Thus, W 2 (Ax, ...,A K ) = ^f =1 nju) = SSE, and T 2 (A 1 ,..., A K ) = Na 2 = 
SS (total). Therefore, by the ANOVA decomposition SS (total) = SST + SSE 
and Theorem 2, we have 

S 2 (A U ...,A K )= T 2 (Ai, ...,A K )- W 2 (At, A K ), 
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hence, S2 = SST and we obtain the one-way ANOVA decomposition of total 
sum of squares. □ 

3. DISCO hypothesis tests. Assume that A±, . . . ,Ak are independent 
random samples of size n\ , . . . , nx from the distributions of random variables 
X\ , . . . , Xk respectively. 

3.1. The DISCO F a ratio for equal distributions. Analogous to the ANOVA 
decomposition, under the null hypothesis of equal distributions, S a and W a 
are both estimators of the same parameter — where X'- and Xj 

are i.i.d. The Gini mean g a (Aj , Aj ) is a biased estimator of — Aj|| a . 

An unbiased estimator of -EHAj — Xj\\ a is (nj/(rij — l))g a (Aj,Aj). Under 
the null hypothesis (1.2) we have 



Although in general D n ^ a does not have an F distribution, F a has similar 
properties as the ANOVA F statistic in the sense that F a is non-negative 
and large values of F a support the alternative hypothesis. The details of the 
decomposition can be summarized in a table similar to the familiar ANOVA 
tables. See, for example, Tables 1 and 2. 

3.2. Permutation test implementation. The DISCO test can be imple- 
mented in a distribution free way by a permutation test approach. Permuta- 
tion tests are described in Efron and Tibshirani (1993) and Davison and Hinkley 
(1997). The achieved significance level of a permutation test is exact. 

Let v = 1: N be the vector of sample indices of the pooled sample A = 
(yi), and let tt(u) denote a permutation of the elements of v. The statistic 
.Fq,^; 7r) is computed as F a {y v ^). Under the null hypothesis (1.2) the statis- 
tics F a (yi) and F a (y v ^) are identically distributed for every permutation ir 




and 




where £ = — Aj|| a . Our proposed statistic for testing equality of dis- 

tributions is 




Sg/jK - 1) 

W a /(N-KY 



of v. 
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Permutation test procedure. 

i. Compute the observed test statistic F a = F a (A;u). 

ii. For each replicate, indexed r = 1, . . . , R, generate a random permutation 
TT r = n(u) and compute the statistic Fa^ = F a (A;ir r ). 

hi. Compute the significance level (the empirical p- value) by 

A = 1 + MF*^ > Fa} = {1 + Ell HFi r) > Fa)} 

P R+l R+l 

where /(•) is the indicator function. 

The formula for p is given by Davison and Hinkley (1997, page 159), who 
state that "As a practical matter, it is rarely possible or necessary to com- 
pute the permutation P-value exactly" and "at least 99 and at most 999 
random permutations should suffice." 

3.3. Limit distribution. For all < a < 2, under the null hypothesis of 
equal distributions, d a (Aj,Ak) converges in distribution to a quadratic form 
of centered Gaussian random variables (see details in Szekely and Rizzo 
(2005a, 2005b)). Hence, under Hq the mean between-sample component 
S a /(K — 1) of the F a ratio converges in distribution to a quadratic form 
of centered Gaussian random variables. The mean within-sample compo- 
nent W a /(N — K) converges in probability to a constant by the law of large 
numbers. Therefore, for all < a < 2 by Slutsky's theorem under Hq, the 
F a ratio converges in distribution to a quadratic form 

oo 

(3.1) Q = J2^Zf, 

i=X 

where Zi are independent standard normal variables and Aj are positive 
constants. 

The DISCO test rejects (1.2) if the test statistic F a exceeds the upper 
percentile of the null distribution of F a corresponding to the significance 
level an. Szekely and Bakirov (2003) proved that for quadratic forms (3.1) 
with E[Q] = 1, 

P(Q>(^ _1 (l-«o/2)) 2 )<ao, 
for ao < 0.215, where <£(•) is the standard normal c.d.f. 

3.4. Consistency. The advantage of applying an index in (0, 2) rather 
than squared distances is that for exponents < a < 2 all types of differences 
between distributions are detected, and the test is statistically consistent. 

Theorem 3. // < a < 2, the DISCO test of the hypothesis (1.2) is 
statistically consistent against all alternatives with finite second moments. 
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Proof. Suppose that the null hypothesis is false. Then Fj ^ Fk for some 
(j, k). Let c> be an arbitrary constant. We need to prove that 

lim P(F a > c) = 1. 

N— too 

Here TV" — > oo is understood to mean that each rij — > oo and 

rij 

lim - =Pj, j = 1,...,K, 

ni,...,nK-»oo n\ + • • • + Uk 

where < p j < 1 and ^2f = iPj = 1. Then 

F(F tl > C )>F(^.*f^ .^>o 



(n i + n fc )(JV--K'), 

Statistical consistency of d Q (A,-,>lfe) for < a < 2 follows as a special case 
from Szekely and Bakirov (2003). There are constants c\ and C2 such that 

lim P(F a >c)= lim pld^A^Au) > 2c( ^ " 1)Wq 



NToo^ N^o' y" 1 ^ fa + p2)(N - K) 



lim P(d Q (A i , J 4 fc )>c 2 ) = l 

JV— >CXD 



by the statistical consistency of d a (Aj, Af~). □ 

The corresponding F2 statistic does not determine a consistent test and 
does not necessarily detect differences of scale or other characteristics. 

Remark 1. A DISCO test is applicable even when first moments do not 
exist. For any distribution such that an e-moment exists, for some e > 0, we 
can choose < a < e/2, which is sufficient for statistical consistency because 
E\\X -Y\\ 2a < 00. 

4. The DISCO decomposition in the general case. Here we use the tradi- 
tional formula notation from linear models. Let Y ~ A specify a completely 
randomized design on response Y by group variable (factor) A with a levels. 
If factor B has b levels, and interaction A:B denotes the crossed factors 
A and B with ab levels, then Y ~ A + B is the corresponding two-factor 
additive model, and Y^A*B = A + B + A:B is the two-way design with 
interaction. 

Let S(A), W(A) denote the between and within components obtained by 
a decomposition on factor A. In this section we omit the subscript a when 
the expression is applicable for < a < 2. 
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4.1. The two-way DISCO decomposition. Applying the theorem for DISCO 
decomposition to the model Y ~ A + B, we have 

T = S(A) + W(A) = S(B) + W(B), 

and, therefore, we have a decomposition 

T = S{A) + S{B) + W, 

where W is given by 

W = T - {S(A) + S(B)) = W{A) + W(B) - T. 

It is easy to check that W > 0, and that W has the form of a weighted Gini 
mean on distances between pairs of observations in cells {Ai P\Bj}, 1 <i < a, 
l<j<b. 

Similarly, we can also decompose total dispersion on factor A : B to ob- 
tain T = S(A : B) + W(A : B). The between component S(A : B) contains 
the between distances on factor A and the between distances on factor B. 
It can be shown that S(A : B) — S(A) — S(B) > by a similar argument as 
in the proof of Corollary l(i). Hence, we can obtain the decomposition 

(4.1) T = S(A) + S(B) + S{AB) + W(A:B), 
where S(AB) = S(A : B) - S(A) - S{B). 

4.2. The DISCO decomposition for general factorial designs. By induc- 
tion, it follows that for additive models with k > 1 factors and no interac- 
tions, the total dispersion can be decomposed as 

k 

T = Y J S{j) + W, 

3=1 

where W is given by 

k 

(4.2) W = Y,W(j)-(k-l)T, 

3=1 

W > 0, and W has the form of a Gini mean on distances between observa- 
tions. [For simplicity we drop the factor label and use a number to identify 
the factor in S(j) and VT(j).] 

For models with interaction terms, we proceed as in (4.1). For a factorial 
design on three factors (A,B,C), the highest order interaction is A:B:C. 
In the decomposition T = S(A :B:C) + W{A :B:C), the between compo- 
nent S(A :B:C) contains between distances for lower order terms. Define 
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Table 1 

DISCO analysis for three-factor model 



Factor 



df 



Dispersion 



A 
B 

C 

AB 
AC 
BC 
ABC 



(o-l)(6-l) 
(a-l)(c-l) 
(6-l)(c-l) 
(o-l)(6-l)(c-l) 



a- 1 
b-1 



c-1 



Sbc 




[S A /df(A)}/[W/f] 
[S B /df{B))/[W/f\ 
[S c /df(C)]/[W/f] 
[S AB /df{AB)]/[W/f] 
[S A c/df(AC)}/[W/f] 
[S BC /df(BC)]/[W/f] 



Sabc 



[S A Bc/df(ABC)}/[W/f] 



Error 



Total 



N- 1 



T 



'In the balanced design / = abc(n — 1). 

t W = W(A:B:C). 

S A = S(A), Sab = S(AB), etc. 

S{ABC) by 

S(ABC) = 5(A :B:C) - [S(A : B) + S(A : C) + 5(5 : C)] 
(4.3) +[S{A) + S{B) + S{C)] 

= S(A:B:C)- [S(AB) + S(AC) + S(BC) + S(A) + S(B) + 5(C)], 

where S(AB), S(AC), and S{BC) are defined as in (4.1). Then we obtain 
the decomposition shown in Table 1. 

Factorial designs on four or more factors are handled in a similar way, by 
obtaining W from the decomposition on the highest order interaction term, 
and splitting the between component into components corresponding to the 
terms in the model. 

Degrees of freedom are determined by the combined constraints on sums 
of distances, as in linear models. The F a ratios for the jth term with aj 
levels in an additive model are 



where df(W) equals residual degrees of freedom in the corresponding linear 



5. Implementation and examples. DISCO decomposition is easily im- 
plemented by computing the Gini sums G from the distance matrix of the 
sample for each of the cells in the model. Each of the components in the 
decomposition is a function of these sums. 



S a (j)/( aj - 1) 
W/df(W) 



model. 
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5.1. Calculation of test statistics. Consider the model Y ~ A where fac- 
tor A has a levels, corresponding to samples A\, A2, . ■ ■ , A a . If V is the N x N 
distance matrix of the sample, let M be the N x a design matrix defined by 

M = (M (j ) = (/{x ie ^}) = {j : *l^L. 

Then Q = M T DM is the oxo matrix of Gini sums G(Ai, Aj), and the 
within-sample sums are along the diagonal of Q. Thus, both T and W(.A) 
are easily computed from Q, and S(A) = T — W(A). 

Remark 2. The design matrix M has no intercept column and has one 
column for each level of factor A, unlike the matrix used to fit a linear model 
in most software packages. For a one-way layout the matrix M is easily 
obtained by software; for example, the R model. matrix function returns the 
required matrix M for the formula Y ~ + A (no intercept model) . 

It is clear from (4.1)-(4.3) and the example in Table 1 that all of the 
required distance components for any given model can be computed by ex- 
panding the model formula to additive form and iteratively computing the 
decomposition on each term. 

The calculations for a multivariate response or general a differ only in 
the initial step to compute the distance matrix T>. 

The DISCO test can be implemented as a permutation test, as outlined in 
Section 3.2. We have implemented DISCO tests in the statistical computing 
software R (R Development Core Team (2009)). The methods implemented 
in this paper are available in the disco or energy package for R (Rizzo and 
Szekely (2009)). 

5.2. Application: decomposition of residuals. Suppose we consider the 
residuals from a fitted linear model on a univariate response with one fac- 
tor. Denote the fitted model L. Regardless of whether the hypothesis of equal 
means is true or false, the residuals do not reflect differences in means. If 
treatments differ in some way other than the mean response, then the differ- 
ences can be measured on the residuals by distance components, < a < 2. 
If we consider models of the type proposed by Akritas and Arnold (1994), 
we could regard the linear portion L for treatment effect as an "intercept" 
term. That is, 

a 

F j (x) = L(x) + R j (x), ^2Rj(x) = 0, 

i=i 

where Fj is the distribution function of Xij,i = 1, . . . ,rij. If all Rj(x) = 0, 
then Fj = L for every j. One can test the hypothesis Hq : all Rj(x) = by 
testing the sample of residuals of L for equal distributions. 
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Fig. 1. Gravity data and residual plots in Example 1 (sample sizes 8, 11, 9, 8, 8, 11, 13, 
and 13). 



The following example illustrates our Theorems 1 and 2. Then DISCO 
decomposition is applied to the residuals. 

Example 1 (Gravity data). The gravity data consist of 81 measure- 
ments in a series of eight experiments conducted by the National Bureau of 
Standards in Washington DC between May, 1934 and July, 1935, to estimate 
the acceleration due to gravity at Washington. Each experiment consisted 
of replicated measurements with a reversible pendulum expressed as devia- 
tions from 980 cm/sec 2 . The data set {gravity) is discussed in Example 3.2 
of Davison and Hinkley (1997) and is available in the boot package for R 
(Canty and Ripley (2009)). Boxplots of the data in Figure 1 reveal noncon- 
stant variance of the measurements over the series of experiments. 

The decompositions by series for a = 1 and a = 2 are shown in Table 2. 
Note that when index a = 2 is applied, the DISCO decomposition is exactly 
equal to the ANOVA decomposition, also shown in Table 2. In fact, with 
our implementation as random permutation test, the F2 test is actually a 
permutation test based on the ANOVA F statistic. In this example 999 
permutation replicates were used to estimate the p-values. 

Residual plots from the fitted linear model (ANOVA) are shown in Figure 
1, indicating that residuals have non- normal distribution and nonconstant 
variance. When we decompose residuals by Series using DISCO (a = 1) as 
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Table 2 

Comparison of DISCO and ANOVA decompositions in Example 1 

DISCO 

Distance Components: index 1.00 

Source Df Sum Dist Mean Dist F-ratio p-value 

Between: 

Series 7 100.62287 14.37470 2.781 0.001 

Within 73 377.27836 5.16820 

Distance Components: index 2.00 

Source Df Sum Dist Mean Dist F-ratio p-value 

Between: 

Series 7 2818.62413 402.66059 3.568 0.002 

Within 73 8239.37587 112.86816 

ANOVA 

Analysis of Variance Table 
Response : Gravity 

Df Sum Sq Mean Sq F value Pr(>F) 
Series 7 2818.6 402.7 3.5675 0.002357 [0.002 by perm, test] 

Residuals 73 8239.4 112.9 



shown in Table 3, the DISCO F\ statistic is significant (p-value < 0.05). We 
can conclude that the residuals do not arise from a common error distribu- 
tion. (The ANOVA F statistic is zero on residuals.) 

The next example illustrates decomposition of residuals for a multivariate 
response. 

Example 2 (Iris data). Fisher's (or Anderson's) iris data set records 
four measurements (sepal length and width, petal length and width) for 
50 flowers from each of three species of iris. The species are iris setosa, 
versicolor, and virginica. The data set is available in R (iris). The model 
is Y ~ Species, where Y is a four dimensional response corresponding to 
the four measurements of each iris. The DISCO F\ and MANOVA Pillai- 
Bartlett F test, implemented as permutation tests, each have p-value 0.001 

Table 3 

Distance Components of ANOVA residuals in Example 1 
Distance Components: index 1.00 

Source Df Sum Dist Mean Dist F-ratio p-value 

Between: 

Series 7 56.66334 8.09476 1.566 0.046 

Within 73 377.27836 5.16820 
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Table 4 

Analysis of iris data and residuals in Example 2 

DISCO analysis of multivariate iris data: 
Distance Components: index 1.00 
Source Df Sum Dist Mean Dist 

Between: 

Species 2 119.23731 59.61865 

Within 147 70.33848 0.47849 

MANOVA analysis of multivariate iris data: 

Df Pillai approx F num Df den Df 
Species 2 1.192 53.466 8 290 

Residuals 147 
[permutation test p = 0.001] 

DISCO analysis of residuals of linear model for iris data: 
Distance Components: index 1.00 



Source 


Df 


Sum Dist 


Mean Dist 


F-ratio 


p-value 


Between: 












Species 


2 


1.69845 


. 84923 


1.775 


0.039 


Within 


147 


70 . 33848 


. 47849 







based on 999 permutation replicates. The residuals from the fitted linear 
model are a 150 x 4 data set. 

Results of the multivariate analysis are shown in Table 4. From the DISCO 
decomposition of the residuals and test for equality of distributions of resid- 
uals (p-value < 0.04), it appears that there are differences due to Species 
that are not explained by the linear component of the model. 

5.3. Choosing the index a. Choice of a test or a parameter for a test is 
a difficult question. Consider the similar situation one has with the choice 
of Cramer-von Mises tests, an infinite class of statistics that depend on 
the choice of weight function. For testing normality, for example, one can 
use the identity weight function (Cramer-von Mises test) or weight func- 
tion F(x)(l — F(x)) (Anderson-Darling test) and both are good tests with 
somewhat different properties. Here we have a similar choice. 

The simplest and most natural choice is a = 1 corresponding to Euclidean 
distance. It is natural because it is at the center of our interval for a. Consid- 
ering implementation for a univariate response, when a = l the Gini means 
can be linearized, which reduces the computational complexity from 0(N 2 ) 
to 0{N\og(N)). 

For heavy-tailed distributions one may want to apply a small a, which 
could be selected based on the data. As an example, consider the Pareto 



F-ratio p-value 
124.597 0.001 



Pr(>F) 
< 2.2e-16 *** 
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distribution with density f{x) = ka k /x k+l , x > a. In this case E[X] exists 
only for k > 1 and Var(AT) is finite only for k > 2. Note that X a has a Pareto 
distribution for a > 0. If one is comparing claims data, which Pareto models 
tend to fit well, the tail index k can be estimated by maximum likelihood to 
find a conservative choice of a such that the second moments of X a exist. 
Heavy-tailed stable distributions such as Levy distributions used in financial 
modeling suggest another situation where a < 1 may be recommended. 

6. Simulation results. In this section we present the results of Monte 
Carlo studies to assess power of DISCO tests. In our simulations R = 199 
replicates are generated for each DISCO test decision. 

Examples 3 and 4 compare DISCO with two parametric MANOVA tests 
based on Pillai (1955) and Wilks (1932) statistics (see, e.g., Anderson (1984, 
Chapter 8)). The Pillai-Bartlett test implemented in R is recommended by 
Hand and Taylor (1987). 

Example 3. The multivariate response is generated in a four group bal- 
anced design with common sample size n = 30. The marginal distributions 
are independent with Student i(4) distributions. Sample 1 is noncentral i(4) 
with noncentrality parameter 5. Samples 2-4 each have central t{4) distri- 
butions. The index applied in the DISCO test is 1.0. 

Results of several simulations are summarized in Figure 2(a) and (b) at 
significance level 0.10. In Figure 2(a) the noncentrality parameter is on the 
horizontal axis and dimension is fixed at p = 10. In Figure 2(b) the di- 
mension is on the horizontal axis and <5 = 0.2 is fixed. Each test achieves 
approximately the nominal significance level of 10% under the null hypoth- 
esis [see Figure 2(a) at <5 = 0]. Standard error of the estimate of power is at 
most 0.005, based on 10,000 tests. 

Results displayed in Figure 2(a) and (b) suggest that the DISCO test is 
slightly more powerful than MANOVA tests against this alternative when 
p = 10. As dimension increases, Figure 2(b) illustrates that the DISCO test 
is increasingly superior relative to MANOVA tests. 

The MANOVA tests apply a transformation to obtain an approximate 
F statistic. Although the data is non-normal, the MANOVA test statistics 
appear to be robust to non-normality in this example and exhibit good power 
when p = 10. This simulation suggests that the transformation may not be 
applicable for test decisions when dimension is large relative to number of 
observations. For comparison with MANOVA tests, dimension is constrained 
by sample size. Note, however, that the DISCO test is applicable in arbitrary 
dimension regardless of sample size. 

Example 4. In this example we again consider a balanced design with 
four groups and n = 30 observations per group. Groups 2-4 have i.i.d. 
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(a) (b) 

Fig. 2. Monte Carlo results for Example 3: Empirical power of the DISCO and MANOVA 
tests against a t(4) alternative, four groups with n — 30 per group, where (a) dimension 
p = 10 and noncentrality parameter 8 varies and (b) p varies and 5 = 0.2. Standard error 
of power estimate is at most 0.005. 



marginal Gamma(shape = 2, rate = 0.1) distributions. Group 1 is also 
Gamma(shape = 2, rate = 0.1), but with multiplicative errors distributed as 
Lognormal(/i = 0,0"). Thus, the natural logarithm of the group 1 response 
has an additive normally distributed error with mean and variance a 2 . 
The index applied is 1.0. 

Results for significance level 10% are summarized in Figures 3(a) and 
(b). Each test achieves approximately the nominal significance level of 10% 
under the null hypothesis [see Figure 3(a) at a = 0]. Standard error of the 
estimate of power is at most 0.005, based on 10,000 tests. 

In Figure 3(a) the parameter a is on the horizontal axis and dimension is 
fixed at p = 10. Each test exhibits empirical power increasing with a in Fig- 
ure 3(a), but the DISCO test is clearly more powerful than the MANOVA 
tests against this alternative. In Figure 3(b) the dimension is on the hori- 
zontal axis and a = 0.4 is fixed. This simulation reveals increasingly superior 
power performance of DISCO as dimension increases. 

7. Summary. The distance components decomposition of total disper- 
sion is analogous to the classical decomposition of variance, but generalizes 
the decomposition to a family of methods indexed by an exponent in (0,2]. 
The ANOVA and MANOVA methods are extended by choosing an index 
strictly less than 2, for which we obtain a statistically consistent test of the 
general hypothesis of equal distributions. DISCO tests can be applied in 
arbitrary dimension, which is not constrained by number of observations. 
The usual assumption of homogeneity of error variance is not required for 
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(a) (b) 

Fig. 3. Monte Carlo results for Example 4-' Empirical power of the DISCO and MANOVA 
tests against a gamma(shape — 2, rate —0.1) alternative, four groups with n — 30 per 
group, where (a) dimension p — 10 and a varies (b) p varies and a = 0.4. Standard error 
of power estimate is at most 0.005. 



DISCO tests, and the distribution of errors need not be specified except for 
the mild condition of finite variance. Moreover, the DISCO permutation test 
implementation is nonparametric and does not depend on the distributions 
of the sampled populations. 

APPENDIX A 

A.l. Proof of Proposition 1. The total sum of squared distances can be 
decomposed as 

XT ~ & m| 2 = ^ ^ \<H ~ a + a - b m \ 2 

m=l i=l m=l i=l 

= ^2 fai^i +ni(a- b m ) 2 } 

m=l 

n 2 

= n\nio\ + n\ 2J (a — b + b — b m ) 2 

m=l 

= nxn 2 \a\ + of + (o - b) 2 ], 
where a\ = (1/m) EZM ~ <*? and °l = 0-M ~ W Similarly, 

n\ ni H2 n2 

^2 {en - a m \ 2 = 2n\a\ and |bj - 6 m | 2 = 2ri|of , 

m=l i=l m=l i=l 
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so that 

<h(A,B) 
(A.l) 



nin 2 


2 


m + n 2 


nin 2 



/*2 i -2 i /- 7\2\ 1 2-2 1 2-2 
-7*1712(0-! + CT 2 + (a - 0) ) 2 n l a ± 2 n 2 a 2 



in. 



1n x n 2 - 2 
(a-b) . 



The well-known identity 

(A.2) n^a-c) 2 + n 2 {b-c) 2 = UlU2 (a-b) 2 

n\ + n 2 

follows from a — c = n 2 (a — b)/(rii + n 2 ) and b — c = ni(b — a)/[ni + 7*2). 
Hence, d 2 (A B) = 2m (a - c) 2 + 2n 2 (6 - c) 2 = 255T. 

A.2. Proof of Theorem 2. One can obtain the DISCO decomposition 
by directly computing the difference between the total and within-sample 
dispersion. Given p-dimensional samples Ai,... ,Ak with respective sample 
sizes ni,...,riK and N = Ylj n ji ^ 9jk = 9a(Aj,Ak) given by (2.1) and 
Gjk = njUkgjk, for j, k = 1, . . . , K. Then for all < a < 2 and p > 1, 



W a = ^g(A,A)-^Y, n j9jj 

j 

= f (E ^ g ^ + E - ^ E 
= (E 2G j« + E - 2 E 

^v- V^T^ J (2 * fe " ^ " 9kk) • 2tE "^"^ 

+^?E n J ( nfe ^ fc ) + on E - 9 E • 



2iV ' " y 2N^ 2 

i<fc j j 



After simplification we have 
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